Imagine you’re a political scientist. We’re looking for common voting patterns in the mid-1980s that would predict whether a US congressperson was a Democrat or Republican. We have the voting record of each member of the House of Representatives in 1984, & we identify 16 key votes that you believe most strongly split the two political parties. Our job is to train a naive Bayes model to predict whether a congressperson was a Democrat or a Republican, based on how they voted throughout the year.
Lets start by exploring the data set.
data(HouseVotes84, package = 'mlbench')
votesTib <- as_tibble(HouseVotes84)
votesTib
## # A tibble: 435 × 17
## Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
## 1 repu… n y n y y y n n n y <NA> y
## 2 repu… n y n y y y n n n n n y
## 3 demo… <NA> y y <NA> y y n n n n y n
## 4 demo… n y y n <NA> y n n n n y n
## 5 demo… y y y n y y n n n n y <NA>
## 6 demo… n y y n y y n n n n n n
## 7 demo… n y n y y y n n n n n n
## 8 repu… n y n y y y n n n n n n
## 9 repu… n y n y y y n n n n n y
## 10 demo… y y y n n n y y y n n n
## # … with 425 more rows, and 4 more variables: V13 <fct>, V14 <fct>, V15 <fct>,
## # V16 <fct>
We have a tibble containing 435 cases & 17 variables of members of the House of Representatives in 1984. The Class variable is a factor indicating political party membership, and the other 16 variables are factors indication how the individuals votes on each of the 16 votes. A value of y means they voted in favor, a value of n means they voted against, & the missing value (NA) means the individual either abstained or did not vote. Our goal is to train a model that can use the information in these variables to predict whether a congress person was a Democrat or Republican, based on how they voted.
Since we have a few missing values (NAs) in our tibble, we’ll have to address them. Let’s summarise the number of missing value in each variable using the map_dbl() function. The map_dbl() function iterates over every element of a vector/list (in this case, every column of a tibble), applies a function to that element, & returns a vector containing the function output.
Our function will pass each vector to sum(is.na(.)) to count the number of missing values in that vector. This function is applied to each column of the tibble & returns the number of missing values for each.
map_dbl(votesTib, ~sum(is.na(.)))
## Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## 0 12 48 11 11 15 11 14 15 22 7 21 31
## V13 V14 V15 V16
## 25 17 28 104
Every column in our tibble has missing values except the Class variable. Luckily, the naive Bayes algorithm can handle missing data in two ways:
By default, the naive Bayes implementation that mlr uses is to keep cases & drop variables. This usually works fine if the ratio of missing to complete values for the majority of cases is quite small. However, if we have a smaller number of variables and a large proportion of missing values, we may wish to omit the cases instead (& more broadly, consider whether our data set is sufficient for training).
We’ll plot the data to get a better understanding of the relationships between political party & votes. Once again, we will use the gather() function to untidy our data, so we can facet across the predictors. Because, we’re plotting the categorical variables against each other, we set the position argument of the geom_bar() function to 'fill', which creates stacked bars for y, n, & NA responses that sum to 1.
votesUntidy <- gather(data = votesTib, key = 'Variable', value = 'Value', -Class)
ggplotly(
ggplot(votesUntidy, aes(Class, fill = Value)) +
geom_bar(position = 'fill') +
facet_wrap(~ Variable, scales = 'free_y') +
theme_bw()
)
We can see that there are some clear differences in opinion between Democrats & Republicans.
Let’s create our task, learner, & build our model. We’ll set the Class variable as the classification target of the makeClassifTask() function, & the algorithm we supply to the makeLearner() function is "classif.naiveBayes".
votesTask <- makeClassifTask(data = votesTib, target = 'Class')
## Warning in makeTask(type = type, data = data, weights = weights, blocking =
## blocking, : Provided data is not a pure data.frame but from class tbl_df, hence
## it will be converted.
votesTask
## Supervised task: votesTib
## Type: classif
## Target: Class
## Observations: 435
## Features:
## numerics factors ordered functionals
## 0 16 0 0
## Missings: TRUE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 2
## democrat republican
## 267 168
## Positive class: democrat
bayes <- makeLearner('classif.naiveBayes')
bayes
## Learner classif.naiveBayes from package e1071
## Type: classif
## Name: Naive Bayes; Short name: nbayes
## Class: classif.naiveBayes
## Properties: twoclass,multiclass,missings,numerics,factors,prob
## Predict-Type: response
## Hyperparameters:
bayesModel <- train(bayes, votesTask)
bayesModel
## Model for learner.id=classif.naiveBayes; learner.class=classif.naiveBayes
## Trained on: task.id = votesTib; obs = 435; features = 16
## Hyperparameters:
Next, we’ll use 10-fold cross-validation repeated 50 times to evaluate the performance of our model-building procedure. Again, because this is a two-class classification problem, we have access to the false positive rate & false negative rate, & so we ask for theses as well in the measures argument to the resample() function.
kFold <- makeResampleDesc(method = 'RepCV', folds = 10, reps = 50,
stratify = TRUE)
bayesCV <- resample(learner = bayes, task = votesTask,
resampling = kFold,
measure = list(mmce, acc, fpr, fnr))
bayesCV$aggr
## mmce.test.mean acc.test.mean fpr.test.mean fnr.test.mean
## 0.09854465 0.90145535 0.08263235 0.10868091
Our model correctly predicts 90% of test set cases in our cross-validation. That’s not bad. Now, lets use our model to predict the political party of a new politician, based on their votes.
politician <- tibble(V1 = "n", V2 = "n", V3 = "y", V4 = "n", V5 = "n",
V6 = "y", V7 = "y", V8 = "y", V9 = "y", V10 = "y",
V11 = "n", V12 = "y", V13 = "n", V14 = "n",
V15 = "y", V16 = "n")
politicianPred <- predict(bayesModel, newdata = politician)
getPredictionResponse(politicianPred)
## [1] democrat
## Levels: democrat republican